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ABSTRACT 

• The study report^ a feasibility study for fusing Item 
Fesponse Theory (IRT) as *a means of equating the Test of- Standard • < 
Written English (TSWE), The study focused on the possibility of 
pre-equ.ating, that is, deriving the equating transfcf nation prior to < 
the final administration ,of the test. The three-parameter l$g£stic 
. model* was postulated *s the response' model and its fit assessed at 
the item, subscore, and ^tctal score level* Minor prcfclettS -were found . 
at, each of these levels but, on.the whole, the three-parameter model 
was fo^una to portray the data well,, T\e adoqua*cy of the equating c ~ 
provided by procedures wfrs^nvestig^vted in .two 1SWE forms. It was. 
concluded theft pjre-equating does npt/app^ar to present problems* 
beyond*' those inherent tc IRT-equafingi (Author) 
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ABSTRACT ' . 



The study reports a feasibility study for using Item Response* Theory ^(IRT) as a means of 
equating the Test of Standard Writtfen English. The study focused oti *fye possibility of 
pre-equating, that is, deriving the equating transformation prior t& t'He final administra- 
tion of the test. The 'three-parameter logistic model was postulated as the response model 
and its fit assessed at the item, subscore, and total score level. Minor problems were 
found at each*of these levels but, 6n the whole^ the" three-parameter'model was found to 
portray the data well. The adequacy of 'the equating provided by iRT procedures was in- 
vestigated in two TStfE forms. It was concluded that pre-equating does not appear to 
present problems beyond those inherent to IRT-equating. 



INTRODUCTION >' 



Equating, in general, refers to the derivation of-transf ormations that map scores on 
'differen t fo r ms of a test onto a, scale in such a way that after transformation the 
scores on the various forms are comparable, Th^-equating methodology tft^t has been 
commonly used (see Angof f , 1971) requires that the 'form being equated fird<^bead- 
ministered to testees. Since in large-sdale testing programs scores are not Tmns^bacH 
to-'testees for four to six weeks it would seem that there is ample time to derive tire 
equating transformation. In practice the bulk of the time is consumed by various data 
processing steps. As a result the equating transformation must be produced in a rather 
short period of time. /Even when no difficulties arise thj^psychometrician is under 
considerable pressure^ 4 

From this pragmatic point of view, one of the most exciting applications of Item 
Response Theory (IRT) is pre-equatlnglsee Lord, 1980, .Chapter 13).. As implied by the 
name, pre-equating refers to the derivation of the transformation prior to the ad- d 
ministration of the form to be equated. This requires 'that IRT item, statistics be 
available on a common metric far all the items that appear in the- final form. % The 
feasibility of implementing pre-equating for' the TSWE is the focus of the present 
study. 



Overview of the Study 

Whether pre-equating works or-riot depends on two^ broad factors. One is the fit of the 
three-parameter model to TSWE data. Since there is no general procedure for ascertain- 
ing fit, several procedures will 'be used ia-the hope that collectively they can be more 
revealing. The second broad factor that may prevent successful pre-equating is lack of 
"situational" invariant in the item parameter estimates. In practice, pre-equating 
requires that the final form be assembled from items coming from various pretest forms. 
This raises the possibility of a context affect on item parameters, which as shown by 
Yen (1980), can be substantial. Tt\e adequacy of pre-equating .will be judged on two 
forms i* which these conditions could be simulated using as a criterion scores equated 
by means of non-IRT procedures. . - < # 

The next section gives *a brief description of the TSWE, as well as the data and 
calibration procedure used* in this study. The^f ollowing two sections will examine the 
fit and adequacy of pre-equating, respectively. Recommendations and suggestions for 
further research will be discussed in the final section. , & 



THE DATA ' ^ 

« 

Description of the TSWE 

The TSWE is a 30-mimite multiple choice test administered .together with ttie SAT. Its 
purpose is t<o help colleges place student's in appropriate English Composition courses. 
It„is not recommended as an admission instrument. The test consists of 50 items; items 
1-25 and ^1-50 are called usage items. , The testee is expected to recognizewriting 
that does not follow conventional and standard written English. An exampl^f this 
type of item is the following: - 

Directions: The following sentences contain problems in grammar, usage, di^tion^^ 
(choice of words), and idiom. . ( 

Some sentences are correct. • s 
No sentence contains, more than one error. 



You will find fhat the error, if there is one, is underlined and lettered. Assud 
•that all other elements of the sentence are correct and cannot be changed. In 
choosing arfswers, follow the requirements of standard written English. 

If there is an error, select the one underliried part that must be changed in ordc 
to make the Sentence correct, and blacken the corresponding space 6n the answer % 
sheet. - 

If there is no error, mark answer space E. 

EXAMPLES: , • 

I. . He spoke ttfuntly and angrily to we spectators / No error 

1 A B C D ^ 3 E 

II. He works every day so that he would become financially i 

A - B * C D 

independent in his old age. No error 1 

E ' ' /.'• ' ' 

The other 15 items., 26-40, are called sentence Correction. In these items, the 
student is expected to recognize unaccept^le usage and to choose the best way of >^ 
phrasing the, sentence. An example of this type of item is the following: ^ 

Directions : In each of the following sentences, some part of the sentence or 
the entire sentence is underlined. Beneath each sentence you will find five 
vays^of .phrasing the underlined part. The first of these repeats the original; 
the other four are different. 

If you think the original is better than any .of the alternatives, choose answer 
A; otherwise, choose one of the others. Select the best version ajid blacken the 
corresponding spaoe oh your answer sheet. % 

( This £s a test of correctness and effectiveness of expression. In choosing the 
answer, follow the requirements of standard wAtten English: that is, pay 
attention to grammar, choice of words, sentence construction, and punctuation. 
Choose the answer that produces the most elEfective sentence — clear and exact, 
without awkwardness or ambiguity. Do not make a choice that changes t,he meaning . 
* of- the original sentence. 

EXAMPLES: ? 

I. Caroline is studying music because jshe has always wanted to become It. 
(A) 'it (B) one of them (c) «a musician • 

(D) one in music (E) this • % ~~ 

II. Because Mr. Thomas was angry, he spoke in a loud voice. 

* (A) he spoke (B) and speaking (C) and he speaks 
TD) as he spoke (E) he will be speaking 

Research on the TSWE has shown it to be a reliable aiicj valid instrument. Table 1 
shows some sample statistics for forms E3-E8. As can be seen, standard errors of 
measurement are about 4.0 and reliabilities are in the u^er .80s. For a 30-minute* 
test these figures are satisfactory. Research by 'fffceland (1976) has also provided 
'evidence of the construct » validity of scores de/ived from the TSWE. For example, the 
correlation between TSWE scores and essay scores is higher than the correlation be- 
tween SAT verbal acores and essay scores. This is to be expected if indeed- the TSWE 
measures writing ability rather than verbal ability. «^ 



TABLE 1. Item and test Analysis Results for Various TSWE Forms 
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-Form 





E3 


E4 


E5 


E6 


E/ 
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ho 


* 

* f 

Admin, date 


» 

12/74 


2/75 


2/75 * . 


4/75 


11/75 


12/75 


N 


1765 


1920 


1790 


1685 . 


1895 


1830 


Reliability 


.890 


.8a5 


: - 87 , 2 


.867 


.893 


".874 


SEM (scaled) 


3.7 


3.8 


4.1 


4.0 


3.6 ' 


4.2 


Mean R-Bis. 


.51 


.49 


* • .47 


.46 


.51 


.49 


Equated A mean 


9.2 


9.4 


9.4 


9.6 „ 


9.1 


8.9 
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Due ta tjie expense involved in iteim calibration the adequacy of pre-equaxing was in- ^ 
- vestigated for only two TSWE forms: E7 and E8. As we shall see, however ; to obtain 
item statistics on even two forms is not straightforward. The 'calibration of a large 
set of items administered to different samples' involves, first, obtaining item~pararaeter » 
estimates on the arbitrary metric defined by each calibration sample, and second, 
placing all items calibrated on different samples on the same metric. ^ 

Parameter estimation . All item parameter estimates used^ in this report were obtained 
using the program LOGIST (Wood, Wingersky, and Lord, 1976). This means that the 
f three-parameter-logistic model (Birnba^um, 196B) was the Assumed response function. 
The function of the LOGIST program is to eWlmate, for each item, the three-item 
parameters: a (discrimination); b (difficulty); and c (a pseudo-guessing parameter). 
Unless otherwise indicated, the following constraints were imposed on the estimatipn: 
.a was restricted between ..01 and 1.25; c wa3* held' fixed to .15 until stage 2/ of 
step 2. Thereafter a, b, and c are estimated except that c's were held fixe4__at a 
constant, c estimated by the program for those items with b-2/a <-2.0 at the end of 
stage 3 of step 2 (Wood et al., 1976). The c's for all other items were restricted 
to a r.ange of .0 to .5. m t . • 

Putting estimates on the same metric . For no particular reason, ot£er than convenience, 
the base* metric was defined with respect to the December 1974 administration of E3. 
Two procedures were used "to place estimates on the E3 metric, pne procedure sej:^ the^ 
senile of the items being calibrated by fixing the b ^estimates of the items previously 
.calibrated. "Obviously this requires that the previously calibrated items already be 
on the desired metric and that they be administered together with the items being 
calibrated. • * * 

The above procedure can be used when previously calibrated items are administered • 
together with uncalibrated items. The second procedure puts item statistics on the 
desired scale by applying^a linear transformation to the item statistics. The pro- b 
cedure requires that the uncalibrated items or new form, be administered by themselves 
to a random sample o*f population X, and the previously calibrated items, or old form, 
also be administered to a random sample of population X. We then calibrate' the items 
separately for., the ,new form and for the old., form. The new form is put onto the scale 
of the old, form in the new administration by petting the mean* and standard deviations ' 



of the abilities equal. We now have v two separate estimates of the b parameters for 
the old form, one fr^m the new administration and* one from a previous administration. «| 
If the model holds,, these estimates are lineaxly related. A variety of procedures can. 
noi* be used to derive the linear relationship to' transform the b->s for the new adminis-,, 
tration, ol£ form, onto the scale of the previous administration. For»example, ttie 
mean arid standard deviations of the. twQ sets of b estimates can be equated. How- 
ever, in this report a robust procedure was used. This procedure, adapted by Lord, 
is explained in Appendix A. \ Once the transformation is derived it is applied to the 
a and b parameter estimates' of t;he new form. ' • * 

K * 

Step-by-Step Description * * % 

J.n what follows we wil^. describe each §tep of the calibration procedure.. The following 
notation will be adopted: TSWE'-forras are. designated by the letter E^and a number. To 
distinguish 4 data from .the same form administered to different samples the sample code 
will precede the TSWE form designation. For example, the parameter \estimates designated 
as W506E3 ar^btained on sample W506 responding to the E3 form. The first two charac- ■ 
ters denote the^ administration date. It is important to note that samples with the 
same first two characters are random samples from a given administration. A "P" after 
a sample designation indicates this set of items consists of pretest items. A "T H at „ 
the end of .the sample-fo^rm designation indicates the parameters have been transformed 
to the metric defined by W506E3. With this bacKground we now detail the steps of the 
calibration. * ' 

» 

(1) Estimate a, b, c for E3 with the constraints indicated ^earlier. E3 is the 
base form and- W506 is the base sample. » , ' i% 

(2) Estimate a, b, c for E3 on X101 sample. 

(3) Derive the transformation X101E3<— >W506E3^using ttye procedure described iri 
Appendix A. 3 * * 

v (4) Estimate a, b, c for E4 Based on 1,500 testees from sample X104 and 1,500 

from sample X105. These parameters are labeled XL04,5E4. 

(5) Apply the transformation X101E3<— >W506E3 to^ the X104.5E4 estimates. The 
transfoifned parameters are labeled X104,5E4T. * * 

(6) Estimate a, b, c for pretest items X104P by fixing the b's of the B4 items 
after transformation, that is, taking the b's from X104,5E4T. This puts the" a, b x c i 
estimates for X104P on the W506E3 metric. % * J 

(7) Same as Step 6 but for pretest Items X105P. 

(8) Estimate 4a, b, c for E5 items on sample X106. The estimates are labeled 
X106E5. , " „ 

(9) Apply the X10UE3<~>W506E3 transformation to X106E5 estimates. The trans-* 
formed estimates are labeled X106E5T. (Note that this, is legitimate because sample's 
Xld6 and X101, on which the transformation was derived,* are randomly drawn from the 
same population.) ' *• 

(10) Estimate a^ b, c for pretest items X1"06P by fixing the b's of E5 to the 
values in X106E5T.* , .*■-*■ 

•(11) Estimate a, b, c for E5 on sam]ple'*X401. Estimates are labeled X401E5. 

(12) Derive transformation X401E5<-->X106E5T. * » 

(13) Estimate a, b, c for K7 items on sample X406. Estimates are labeled, X406E7. 

(14) Apply transformation X401E5<— >XL06E5T to X406B7 estimates. Transformed 
estimates are labeled X406E7T. Again, this is 'legimate since X401 and X406 are 
randomly drawn* from the same population. 

(15) Estimate a, b, c for E5 items on sample X501.- Estimates are labeled X501E5. 
► (16) Derive transformation X501E£<— >X106E5T. 

(17) Estimator, b, c for E8 item on sample X506. Estimates are labeled X506E8. 

(18) Applv transformation X501E5<-->X10ro5T to X506E8 estimates'. Estimates are * 
labeled X306E8T. (See note in step 14.) 

(19) Fix the b^s of 20 pretested items to the 'estimates from X104P/ 5pf05P, and 
X106P\ Estimate the a, b, and c of the remaining items based on sample Z101; also 
reestimate the a and c of the 20 'pretested items. The parameters are labeled Z101E7., 
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Sample 
Code' 



W506 

X101 
X104 

X106 

X406 
X401 
X306 
, X501 
£101 
Z201 



E3 



E4 



X104P 



TSWE form 
X105P X106P E5 



E7 



E8 



Admin . 
Date 



50 



50 



8 



50 



0—0 
.© © 




12/74 

j 

2/75 
2/75 
2/75 
2/75" 

11/75 
11/75 
12/75 
12/75 
* 1/77 



3477 



1 • .-1 x , 

FIGURE 1. Data used in tfee calibration. The number within a square indicates the 
number of items for which a v , b, and c 'were estimated. The number within 
the circle indicates the number of items for which b was fixed and a and 
t were estimated. A double-headed arrow means the development of a 
transformation. The single-headed arrow means the application of a 
transformation. A 4$ffnecfcing itn§ without arrows 'is used to indicate 
items wete administered; to the same sample. The forms X104P, X105P, 
and X106P were pretest forms. * 

(20)' Fix the b's of .14 .pretested items to' the "estimates from X104P, Xl05P,^and 
X106P. Estimate the a, b, c of the remaining items based on sample-Z201. Also re- 
estimate the a ancFc of the 14 pretested -Items. The parameters are labeled Z201E8. 

■ Figure 1 will be Uelpful in visualizing "the calibration procedure. It is* also 
.a.us-eful representation of the relationship among the various data sets used in this 
'report; (It should be pointed out, that this complex procedure was required to sim- 
ulate pre-equating conditions.) ^ 

Several sets of parameter estimates resulted from the calibration effort. Two 
additional sets were formed and labeled E7P and E8P. E7P and E8P contain 20 and 14 
items, respectively, with the a, b, and c taken from X104P, X105R, and X106P. The 
remaining 30* and 06 a's, b's, and c's we're taken from Z101E7 and Z201E8, ' respectively. 
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V^IT OF THE THREE-PARAMETER MODEb TO THE TSWE DATA • - 



Conceptually, it seems useful to distinguistwbetweerf within and between population lack 
.of fit. -Within population lack of fit can arise as a result of the violation *of tbp - 
local independence or unidimensJLonality assumptions. FoV example responses to certain 
items in the test may be mediated by a different configuration of .cognitive processes. 
Between population lack of fit on the other. hand occurs*when different populations re- - 
spond to the same items with a different configuration of cognitive processes. This 
may occur, for example, -if the demographic composition of the population is different . 

These two components of lack of f-it will be examined by* ap^ying" various fit 
criteria to # the data." Specif ically, ' the following^p^dcedures were used: >. 



o Examining, the estimated item on a'bility regression. 

o Contra/sting oblerved^and expected distribution of number-right scores, 
o Examining the factorial: structure of 'the TSWE. - > 



Evaluation of Fit at the Item Level 



An intuitivejy appealing way to examine fit is by comparing the observed item on ability 
regression against the estimated item on ability regression predicted by the model, i.e., 
the item characteristic curve. The comparison permits a visual assessment oif how well 
the estimated parameters portray the response data for a' given item. For the present 
application plots were constructed as f611^ws, for each item. The estimated item 
characteristic curvets given by 



P ± (6) = c ± + 



(1 



| 1 + exp t-1 



.7a ± (8 - 



where a. 



*i* u ±* anc * °i are t ^ ie est i mate ^ Item parameters for item i, and 
probability* of ^answering the item correctly for someone of ability-6'. 



P ± (6) 



is the 



The observed item on ability regression was computed by dividing the 0 into inter- 
vals of ,4 and grouping students into those intervals based on their 6. fi Within the kth 
interval the probability''of a correct response was computed as 



ik 



/N 



where NFC^ is the number of testees'who' answered the item correctly in the kth interval; 
0^ is the flumber of 'students who omitted the item in the kth interval; A is the number 

of alternatives, in the item; .and^ is the totalNmmber of testees in the kth interval. 

i ' 
For our purposes it is of most interest to examine the, plots for items where the 

, b parameter hatf been fixed to their pretested value. 'Figures 2 and 3 shows the plots 

for the 20 Z101E7 an4 14 ZMlES items for which the b was fixed. The squares constitute 

the item-on-ability -regression; the solid curve is the .estimated ice. 

The size £f the square i^s proportional to the number*of testeeNs in that interval 
of 4 6. *The asterisk ttext to the b value indicates the b parameter was fixed to that 
value. Some Itejis- al^o show an asterisk next to the estimated c. This*means that c 
was fixed to c, 4 the constant derived earlier (see- page 3) r For E7, ^as can be seen, 
for most of the items the data fit the estimated ice rather weil despite the fact" the 
•b parameter, was estimated £*om a pretest administration, ✓ There are some exceptions, 
' however, including items' 6.23, 26, .42, and 46: For E8 most of the items fi£ the 
•^estimated ice with £ he' exception of item 34. % 
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\ . > TSWE.Z101E7 / Item 5 

4 A-Q.7671 B— 1.4013* C-.1895* 

J § R-BIS-0.5455 

% 
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TSWE. Z101E7 Item 6 

A«0.3765 B«-2.66£8* 0.1895* 

R-BIS°0.4581 




TSWE. Z101E7 Item 7 • 
A-0.654? - B— 1.0915* C-.1895* 
R-BIS-0.3682 




TSWE*Z101E7 Item 14 
A-0.9851 B—0.2883* O.1082 
H-BIS«0*.5988 




TSWE.Z101E7 item 16 , 
A=0.8846 B— 1.P671* 0.1895* 
R-BIS=0.5604 ' 




TSWE.Z101E7 Item 17 
A-1.1350 B=0.3O24* C-.2865 
R-BIS=0j4811 




TSWE.Z101E7 Item 19 



A=l . 2500 B*-0 . 022 s 4* O . 204 1 



R-BIS=O.6015 




TSWE. Z 101 E7 Item 23 
A-0.8133 B— 0.5267* O.0151 
R-BIS-0.6019 



ABILITY 

FIGURE 2. Item-on-ability regression for E7 items IfcLth fixed b parameter; 
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h TSWE.Z101E7 Item 26 TSWE.Z101E7 Item 34 

» A«0.6539 B—1.9233* 0.1895* A?0.8664 B=0.1381* C-.1182 

< R-BIS-0.3924 ' . ^ ' R-BIg-0.5024 * . 




••^ . -fc •! 4/1 t * * -i ft, I ft ft 

TSWE.Z101E7 Item 28 TSWE.&101E7 Item 39 

♦ * A-1.0704 B— 1.2240* 0.1895* A-0.9004 B— 0.5551*- C*.2828 

R-BISj^.5899 < R-BIS-0.6019 
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, TSWE.Z10tE7.-, Item 40 
pQ 1>/Af 1.0324 B-0.2453* C=.25l3 
- * R-BIS«d.S606* 




TSWE.Z-101E7 Item 42 
,A»0.5489 tf—1,0651* 0.1895* 
R-BIS-0.5267 




TSWE.Z101E7 Item 46 
A-0.5579 B-0.1973* 0.1895* 
R-BIS=0.4896 




TSWE.Z101E7 -Item 49 
A*1.2500 B-0.2094* O.0536 
R-BlVo.71?9 
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TSWE.Z201E8 Item 3 " 
A-0.5824 B— 1.6718* O.1530* 
R-BIS-0.3823 
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TSWE.Z201E8 * Item 4 ' 
A-1.0297 B«-1.2l6l* O.1530* 
R-BIS«0.4990 




4 J — 7T 
TSWE.Z201E8 Item 8 
A»l . 2500 B«-l . 17fc9* O . 2332 
R-BIS»0.8554 




~i 1 r 

TSWE*Z201E8 Item 13 
A-0.3836 B--1.7767* 0,1530* 
R-BIS«0,2523 




"5 7 — i 1 r 

TSWE.Z201E8 ^Item 16 ^' 
A»l . 0001 B«0 . 6295* ; g» . 1^00 
R-BIS-0.4227 
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TSW6.Z201E8 * Item 24 * 
Ji?0 . 6 190 B-tf. 9270* 6« .0669 
R-BIS-0.3230 * 
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TSWE.Z201E8 Item 25 
A~L.<2500 B-0.6935* < O.0925 
R-BIS-0.5566 
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TSWE.Z201E8 Item 30 

A-0 . 76,64 B«-0. 8962* O . 1530* 

R-BIS-0.4823 



ABILITY 



FIGURE 3. l£em-on-ability regression for E8 items with fixed b parameters. 
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TSWE:Z201E8 ' Item 31 

A-0 . 687 1 B«-0 . 9647* C- . 1530* 

R-BIS-0.3941 




1 1 r 

TSWE,Z201E8 Item 34 
A«0,6195 K«-0,3907*N 0,1530* 
R-BIS*0.4324 » 




TSWE*Z201E8 Item 35 

A»0 • 3390 ^ B-0 • 0644* O • 1530* 

R-BIS=0,2420 . 
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TSWE.Z201E8 Item 38 

A« 1 • 0068 B-0 • 7538* O. L348 

R-BIS-0.4235 
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TSWE.Z201E8 Item 39 
AO, 7885 B«0,5509* C-.153S* 
R-BIS-0.3964 1 
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TSWE. Z201E8 Item 43 
A-0.6974 B«-0.4696* O.1530* 
R-BIS«0,5019 



ABILITY 
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Evaluation of Fit at the Total Score Level I * 

A logical extension of the previous procedure islto consider how well the model pre^r 
diets the distribution of number-right scores in la given sample. Since the three- 
parameter model has no way of predicting omits* fjor a given individual the analysis 
is based on number-right score rather than formula scores. * 4 The rationale v of this 
procedure is to compare a prediction of the model+-in this case the frequency dis- 
tribution of number- right scores— against the empirical results— in this case the 
observed distribution of number-right scores.' Although a number of, indices could be 
used to quantify the discrepancies between the predicted and observed distribution, 
none vas used since it would have required additional programming. Therefore, the . 
assessment of fit will also be judgmental. \ f 1 

• \ 9 

The observed frequency distribution of number-right scores was obtained by simply 
tabulating the number of testees at each number-right score\level. The predicted fre- 
quency distribution was obtained by a complex algorithm; however, its conceptual, 
equivalent is easily understood as follows: * % " 

For each testee determine n*, the number of items reached. 



Compute the P.(e4^ and Q. (6 ), where P., (6 ) is the probability of answering the 
i a , i a da* r J 

_ith item correctly, as given by the three-{>arameter fc*gistj.c model, for a given 6 ; 

Q ± (e a ) -i -p ± (e a ). . ' 

^Generate all possibifcrfn* response vectors such that u^^ = 1 indicates a correct 
resjronse, and u^ ■ 0 indica^s an incorrect response. ^ 

For each vector substitute P. (6 ) if u = 1 and Q. (0 ) i/ u = 0; multiply the 
» \ l a i i*ai 

probabilities to obtain the probability of the response vector. That is, compute: 

i 

n* . u . (1 = u ± ) 



Croup response vectors with the same number of i's, Ther*ejare n* °+ 1 such groups 
corresponding to number-right scores of 0, 1, 2, .... n*^ 

Sum the probabilities of each response vector within a group. The sum of these 
probabilities is the expected frequency of this number-right score. When this is done 
for each group we have the expected distribution of number-right score, for one 
testee. 

Repeat the /above steps for each testee and sum the distribution over examinees 
for each numbar-right score. 

Divide by N, the number of testees, which yields the expected distribution of 
number of rights scores for the entire sample. 

%. ; * * ^ 

Notice that this procedure assumes local independence since we take the product 
of probabilities in the fourth step, wtiich is 'the Reason why the comparison against v 
the observed distribution^ of number-right scores, may be viewed as 'a test of fit. ^ 



1. The TSWE is scored operationally using scores corrected for guessing. We refer 
to such scores as formula scores. 
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TABtE*2. Mean and Standard Deviations for Observed and Expected* Number-Right 
Distributions Including and Excluding Students with Omitted Items 



Form Sample 



X406E7 
Obs.. 
Exp. 
N 



Including Omits 
Mean S.D. 



33.94 
34.03 
2960 



8.68 
8.95 
2960 



Excluding Omits 
Mean S.D. 



34.77 
34.72 
2512 



8.45 
8.82 
2512 



Z101E7 
Obs.' 
Exp: 
N 



3K26 
31.34 
* 2973 



8.69 
9.93 
2973 



32.17 
32.14 
, 2461 



9.49 
9.84 
246.1 



X506E8 
Obs. 
Exp. 
N 



32.86 
32.96 
2980 



8.43 
-8.78 
2980 



33.63 
33.61 
2514 



8.19 
8:%3 
2514 



Z201E8 
Obs. 
Exp. 
N 



33.42 
33.58 
2980 



7.63 
7.94 
2980 



34.20 
34.18 
2417 



7.39 
7.81 
2417 



The procedure was applied to the following data sets: X406EI, Z101E7, X506E8, and 
Z201E8, once excluding te*tees with omits and once including those* students. The re- 
sults are shown in Figure 4, but are best summarized by Tabie 2 which reports the mean 
and standard deviation of the observed and expected distribution. With omits included 
the expected mean and the expected standard deviation are s.omewhat larger than the 

* corresponding observed values. With omits excluded the expected standard deviation- 
is also larger but' the expected mean now is slightly smaller than the observed value. 
Apart from these differences, however, the discrepancies between observed and expected 

r means and standard deviations do not appear an/ larger for Z101E7 and Z201E8 where 
some of the b's had been fixed. * 

Factorial Structure of TSWE Data , 

The 'third method of assessing fit involves factor analysis. Attempts to examine fit 
through factor analysis. (e.g. , Indow and Samejima, 1966) have done on inter-item 
correlation matrices. By contrast, the present use of factor analysis involves s 
correlation among subscores. Since .the TSWE contains two item types, a reasonable • 
hypothesis is that response to each item type requires somewhat different processes; 
that i8,~£he two item types do not measure the same construct. 

Method . Two formula scores were computed for each item type by totaling across odd 
and even items separately. To insure that the odd and the even scores were based on 
the same number of items, item 25 was excluded from the odd items for the usage items 
and item 40 was excluded from the even itemsMotf the sentence correction items. 

Correlation and covarianCe matrices were commuted based on the four scores for 
the following* data sets: W506E4, X104E4, X106E5, iX406E7 4 X506E8. The matrices appear 
in Appendix B. A two-factor model was fitted to each of these correlation matrices 
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N, UMBER-RIGHT SCORE 

FIGURE *4. Observed and expected distribution of number-right ^core 
including (left) and excluding (right) students with 
omitted responses. * 



20 



.A. 
TABLE 3. 


Summary, Results 

* 


> 

of Facto* Analysis with Two Item Type Factors 

s V * . 


Form 


Sample 


. X 2 


d£- 


P • * 


Correlation Between 
. - True Scores* 


WjUD 




.42. 




.52 " 


.889 (.01) 


X104. 


£4 


2.77 


1 — 


.10 4 


.884 (.01) 


X106 




3.50 


1 * 


.06 


.8^1 (.01) " 


X406 • 


E7 


.15 


1 


.70 


<- 

.879 (.Oir 


X506 


E8 


, 18.04 - 


N 1 ~ 


.00' 


.915 ( v 01) 



*The value in parenthesis is the asymptotic standard error of the correlation. 



* and the parameters estimated by the maximum likelihood procedure using, the 'C0FAMM 
Program (Sorbom and Joreskog, 1976). The m<?del tests the hypotheses that the correla- 
tion 'matrix, E, is described as follows: i 



0 
0 



£0" 



X 6 * 
0 

0 * 



0 


0 


0 ; 




f 




X 7 , 


0 


a 


\ 






0 


X 8 


0 


0 


•6 m 


X 9 



The x's are. parameters to be estimated. The 0's indicate the corresponding parameter 
is fixed to z4ro. x i through x^, represent the factor loadings; x 5 is the , correlation 

among the two factors each 'defined by & item type; and * 6 through' x 9 represent t$e 

unique variance Qf each subscore. This model is discussed by Joreskog (1978) who 
notes it is a restatement- of an earlier model (1956) by Lord. *^ ^ 

For our purposes the model can be used to test the presence of "an item type 
'effect by estimating the model with'* 5 set to 1.0, that is, hypothesizing, the cor- 
relation among true scores to be perfect. This was done for the five^ata sets 
mentioned earlier and, 'in every case, the hypothesis x 5 - 1 was rejected with p < .0001. 

The model was then estimated allowing x 5 to *be estimated. * The results are shown in 

Table 3. For all forms but E8 the model now fits (p > ,05). Even when tfi% p values 
are not accurate, sinc,e the data do not have a multivariate normal distribution as 
assumed by 'COFAMM, the magnitude of the chi-square statistic suggests that for E8 the 
two factor models did not account as well for all the correlations, 
>♦ 

The estimated correlation between the true seores for the two item types is. also 
^ shown in Table 3. The asymptotic standard error of the estimated correlation is shown 
*in parenthesis. As^can be seen the correlations are below .90, expept for E8. 

This analysis suggest* that. the structure of the TSWE can be understood by postu- 
lating two item-type factors. Although the correlation between the two scores is very 
high it does not approach 1.0 as would b/6 expected if both scores measured a single 



construct* More concretely removing the constraint that the two factors correlate 
1.0 reduced, the residuals to almost zero. To \ illustrate, the residual covariance 
matrices for W506E3 -under, thte two models are sjiown below. The residuals under tsJae 
hypothesis X- s 1 are shown above the diagonal* The corresponding residuals when x. 

is estimated are shown*below the diagonal.* (The first letter, E or 0,- stands for even 
and odd, respectively; US ■ usage, > SC ■ senfeenc^ correction.) 

OUS EUS • t OSC ESC* 

OUS .012 -.016 -.014 • 

EUS .000 ' % - . 't -.013' -*019 

OSC -.002 \J)02 * - „ .083 ' 

ESC . .002 -.002 .000 

It can be seen all residuals are substantially reduced when x., the correlation * 

among the two factors, is estimated rather than "fixed to a value of 1.00. In short, 
the analysis presented here suggests that the internal structure of ^he TSWE can be 
understood better in terms of two item-type factors.* * , 

Summary t „ * 

It was shown in this section that there appears to be some lack of ^ fit of, the three- 
parameter model to the T&WE data. This was obvious it the, item level where some 
observed and «cpected items on ability regressions; were not congruent at the subscore 
level where ittappeared that two item tape factors were necessary to fully account 
for the internal structure of the data, hence suggesting lack of undimensionallty . 
Also, -at the total score level, the modal did not^seem to reproduce the distribution _ 
of number-right scores completely accurately. The important question from a 
practical point of view is whether, these deviations from the model have «n impact 
on equating results. The next section presents the, results relevant to that question. 

. • • . . , i 

ASSESSMENT OF PRE-EQUATING . ' 

• ? ' 9$/ 

The criterion for judging adequacy of pre-equating ii\the present study is by comparison 
to conventional equating. Implicit in this choice of* ^criterion is the assumption that 
conventional equating provides a reasonable criterion. * While this is not generally true, 
the conventional equating was done by spiral ing the 'old'sfrnd new forms in random samples 
of the View population. Furthermore, the test specifications are observed^ ve^y strictly," 
and, as a result, there is minimal variation across^ forms of the test. All of this. _ 
suggests conventional Linear equating should work adequately with -the TSWE. s Neverthe- 
less, three conventional equatings were used as «cr iter ton. One Criterion, labeled CI, 
is the operational linear equating, that is, the procedure used in the reporting of 
scores. For E7 and E8 the operational equatings used SAT-V an4 SAT-M as anchor tests. 
The. second criterion used, C2, was similar to CI except only SAT-V was used as an anchor. 
Finally, the third criterion, C3, was equi-percentij.e. equating usirife SAT-V as an anchor 
test. (It would have been desirable to^use SAT-V and 'SAtrM as anchors for equi-pefcentil 
equating also, but » the computer program did not allow it.) A description of linear" and 
equi-percentile equating can be found in Petersen, Cool?„ and Stocking^ 1981. 

The comparison of pre-equating arid operational equating results will be limited to 
two forms, E7 and E8, since only on these forms was it' pbssifcle to simulate pre-equating 
conditions/ For operational use, that is", to actually report converted scores to 
testees, E7 had been equated to E5 in ti^November 197f> administration; E8 had 
been also equated tp E5* but in the December 1975 >jJministration. .In both cases, the* 
"old" and "new" forms were spira'led, and scotes on the S SAT-V and SAT-M were used to ^ 
adjust the TSWE scores before equating the TSWE means and standard deviations of the 



two forms. The results of conventional linear equating are two parameters, usually 
referred to as A and 6, which are\ used for converting' formula* scores to the TSWE 
metric as follows: 

CS = A (FS) + B 

where CS is the converted score and'FS is v the formula score. .Since the TSWE'has 50 
items, FS ranges, from -12 to 50. If CS is less than 20 it is set to 20. Also, if CS 
becomes greater than 60 it is set tq 60. The results of equal percentile equating is 
a table which converts scores in the old form to. corresponding scores in the new form. 

For methodological as well as practical reasons twc^.evels of pre-equating were 
s-tudied. The .least demanding level' consists N of estimating IRT parameters for the new 
form, E7 or E8 in this case, when the. items appear together as a form. Strictly 
speaking, this is not pre-equating but merely IRT -based true-score equating. We 
will refer to it* as IRT-equ^ting . (If the IRT parameters had been estimated on a 
different population it could be considered truly pre-equating) . Nevertheless, 
precisely be4d«se it is- a very undemanding form of pre-equating, the results frpro 
this comparison -serve as a good benchmark to compare the results of pre-equating 
proper. The parameter sets for the new forms for this comparison were X406E7T and 
X506E8T for E7 an&.E8, respectively. . 

For the second level, r pre-equating proper, the parameter sets E7P and E8P'were 
used. ' For E7 and E8 the a, b and c's of 2CC'and 14 items, respectively, were taken 
from parameter sets X-104P, XIO^P, and X106P: For the remaining items the a, b, and , 
-c^Twere taken from the parameter set Z10OE7 for *E7, and the parameter set Z201E8^ 
for ^8.;* ' * ' " 

ftithin IRT-equating and pre-equating, thirae oltf -forg^ Were used, namely E3, E4, 
and E5, to put the new form E7 or E8 on the TSWE scale. ,* 

Equating Based-,on IRT 

The procedure used to transform formula scpres on the new, form to scaj-ed scores can be 
described in general a,s follows: For a given -true score, on the new form find the cor 
responding 6. Next', find the true score on the "old 1 ; form associated with this 6. 
Finally, apply existing conversion parameters to put the equated true scores on the 
TSWE scale. Since theJSWE. is scored using-Tormula scores, the actual procedure is 
based on true formula scores. A step-by-step description follows. 



For each integer formula score on the new form FS new » greater than 

* 50 

X - Z, c 4 
i=l 



*"i,new 



and less than or* equal to 50 (where c ijnew is the c estimated parameter for the ith 
item) compute the associated true score number right scale as follpws 

NR - .80 FS + 50/5 
new new^ 

This is based on the fact that if an examinee attempts all items in the test, number- 
right and formula score are linearly related with slope m/(m-4), where m is the number 
of alternatives, and constant n/(m-l), where n is the number of items in the test. A 
similar relationship holds for true formula scores and true scores as Tshown by Lord 
(1$80', Chapter 15). 

The next step is to find the 6 associated with a given NR new « This is done by 
solving for 6 in the equation: 



23 



50 , 

NR * <I P* (6) > , 
/ new ±ml i ^ 

where the P i (8) is computed using the Aid for the. .new form. 

Having found the needed 6 compute the corresponding tr\ie score in the. old form as 
follows: ( ^ t 

50 * * 

-• NR old V, P i (0) ' ' ■ *' S 

i=l r 

where now ? ± ( 0 ) is computed using the a^ and c from the old form. 



The true formula score corresponding to this true score is 

' . x . * ' * 

^old =NR old /- 80 - 50 / 4 

•^Finally, FS Q ^ d is converted by means of existing parameters A and B. as follows: 

CS * A(FS*) + B ' . ' 

* .old 

If FS Q i d is less than I a- somewhat different procedure is used. 

The procedure is described by Appendix C of Chapter 13 of Lord (1980). This 
procedure was applied .with E3, E4, and E5 as old forms (using parameter sets W506E3, 
X104,5E4T, and X106E5T, respectively) and hew forms E7 and E8 (using parameter sets 
X406E7T and X506E8T for IRT-equating, and parameter sets E7P and E8P for pr^-equating) , 

Results * 

The results of the three criterion equatings yere all very close. This can be seen 
in Tables 4 and 5 which report the criterion equatings corresponding to every five 
formula raw score points. (Full point-by-point conversions are found in Appendix C.) 
The larges.t discrepancy among criterion is 1 point for both E7 and E8. The dis- 
crepancy of IRT-equating and pre-equating with the criterion equatings, however, 
is large'.. 



The magnitude of the discrepancies can be appreciated byexaminin3 the mean and 
standard deviation; corresponding to the criterion equatings, 7 lRT-equat^ng and pre-' 
equating. The mean and standard deviations arc based on thfc frequency 'distribution 
observed on the first national administration* of E7 and E8.,/ (Thel^e frequencies can 
be found in Appendix C.) In particular, four trends are more or less%b^ious . First, 
operational andlj|gLT-based equatings are much more discrepant f^or E8 than for E7. 
Secondly, for the IRT-based conversions the mean is higher and the Standard deviation 
smaller compared to the criterion equatings. Thirdly, the choice of an old form seems 
to affect the discrepancy of IRT-based and operational equating. More 'concretely, 
using E3 as an bid form for either pre-equatin^or IRT-e<Juating fields the most dis- 
crepant results. Finally, comparing the results for pre-equating and IRT-equating, 
pre-equating for E7 actually yields less discrepant results than IRT-equating but .the . 
opposite is true for E8* 

A more detailed analysis of the results can be obtained from an index suggested 
by Marco, Petersen, arid Stewart (1979), namely, the weighted mean squared error, 

Sj fj dj/N Ef.j (dj - d) 2 /N * d 2 or 

(total error) p ^variance of differences) + (squared bias) 
18 , ' 



TABLE Summary Conversion Table Comparing Conventional .Equating, IRT-equating, 
Pre-equating for E7 ' * # 













IRT-eJquatlng 




Pre-equating 




Criterion 






Old Form 






Old Form 


Raw Score 


CI 


C2 ^ 




M 


E4 * 


E5 


E3 




50 


60 


60 


60 


60i 


r 60. 


60; 


60. 


60. 


45 


•58 


58 


59 


58. 


,58. 


59. 


" 57. 


57 . 

si. 


' 40 


53 


, 53 
'48 


53 


. 54. 


^ 54, 


54 . 


53. 


35 


" 48 


48 


49. 


*9. 


,49. 


48. 


48. 


30 


43. 


43 • 


43 


• 45. 


' 44. , 


44. 


. ' 44. 


.44. 


25 


. 38 


38 


38 


• 40. 


39. 


39. * 


40, 


39. 


. 20 


* 33 
2*8 


. 33 


33 


35. 


34. 


34. 


35. 


34. 1 


45 


p 28 • 


,28 


30. 


28. ■ 


28. 


30. 


/ 29. 


- 10 


22 


22 ' 


22 


24. 


23^ 


23. 


25. 


24. 


5 


20 


20 ; 


20 


20. 


20. 


20. ' 


20.. 


20. 


0 


20 


• 2Q . , 


20 




20. 


20. 


20. 


20. 


-5 


20 


20 , 


♦20 


/q. 


20 i 


20. 


20, 


c '" Z0. 


-io 


20 


20 , 


20" 


/20. 


ZO. 


20. 


20./ 


• 20. 



E5 



60 .. 
58. 
52. 
48. 
43. 
39. 
34. 
29. 
24*. 
20. 
20. 
20. 
20. 



Mean 

s:d. 



43.84 
' 9.70 



43.83 
9.73 



43.69 
. -"9 4 80 



45.25 
9.20 



44.55 
9.64 



44.78 
, 9.81 



44.66 
8.62 



44 .T)6 
8.93 



44*^7 
9.05 



Cr is based on linear observed score equating using SAT-V and SAT-M as anchors; 
C2 only uses SAT-V as ^ri- anchor; C3 is based on equi-percentile equating. /The 
means and standard deviations are based on the formula score frequency distri- 
bution for the first national- administration of E7. 



TABLE 5, 



Summary Conversion Table Comparing Conventional Equating, IRT-equating, 
and Pre-equating for' E8 



Raw Score 



Criterion 
CI C2 C3 



E3 



IRT-equating 
• Old Form 

E4 E5 



Pre-equating 
Old Form 
E3 E4 • E5 



50 


* 60 


60 


60 


60. 


60. 


60, 


60. 


60. 


45 


59 
•*53- 


59 


60 , 


• 58. 


59. 


60. 


58. 


58. 


40 • 


53 


54 


. 53. 


53. 


53. 


54. 


53. 


35 . 


47 


* 48 


• 48 


49. 


48. 


48. 


49.' 


49. 


'30 . 


** 42 ' 


42 


- 42 


43. 


43. 


42. 


44 T 


43. 
38. 


25 


. 36 


36 


35 


, 38. 


37. 


37*. 


39 


20 


30 


30 


29 


33. , 


31. 


32. 


34 4 


33 


15 * 


*24 


24 


, 24 


27. * 


*26, 


26. 


29. 


27. 


IP 


- 20 . 


20 


20 


22. 


21. 


■21.'* 


23. 


22. 


5 


20. 
20 


20 


20 


20. 


20. 


20. 


20. 


20. 


0 


- 20 


20 


20. < 


' 20. 


20. 


20. 


20. 


-5 


. 20 


20 


20 


20. , 


20. 


20. 


20. 


20. 


-10 


20 


20 


20 


20. * 


20. 


'20. 


20. 


20. 



Mean 
S.D. 



42.10 
9.96 



42.14 
9.98 



42.04 
10*21 



43.59 42.84 42.94 44.30 .43.60 
8.77 9.26 v 9.31 ' 8.34 8^.78. 



60. 

59.- 

54. 

4Q. 

43. • 

38. 

33. 

*22v 
20. 
20. 

20. « 
20. 

'43.59 T 
8.77 / 



See footnote to Table 4. The mearf&Tafod standard deviations are based oil the 
formula score frequency distribution for the first national administration of E8. 
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-25. 



where dj.» (t* - t^), t* rs the criterion score (which* in this case corresponds to the 
operational score) for raw score x ^ ; tj is the ^RT-based converted score corresponding 
to the same raw score x ^ ; d - f^ d^/N f^ is the frequency of x^ and N =* Zf^. 

i 

Tables 6 and 7 show the computed indices for E7*and E8, respectively! The f's 
used correspond to the frequency' of x in the first national 'administration and can 
be found in^ Appendix C. ^ * t 

- An examination, of tlje discrepancy indices largely corroborates the results noted 
-earlier. Wi'ttfin a given, equating procedure there is, variation due to the choice of • 

old form. For IRT-equating E3 yields the most discrepant results in terms of the 

weighted mean squared differences, E4 the least discrepant results and E5 is in be- 
- tween. 'This is true for both E7 and E8. For, pre-equating E3 also yields the most 

discrepant results but* eT yield the least discrepant results with E4 ^in 'between. 

Again, this is true .foj, both E7 and E8. ' * . 

F^r E7, Cl, the. linear equating using SAT-V and SAT-M as anchors, yields the » 
least discrepant results for both JRT-equating and pre-equating followed by C2, linear 
equating tising only SAT-V as anchor, and C3, equal percentile equating. For E8, how-' 
ever, using linear equating with only SAT-V as anchor yields the least discrepant re- 
» suits followed by Cl and C3 in that order. *That is, using equi-percentile equaling as 
a criterion yielded the most discrepant results for both E7 and E8 and IRT-equatin£ 
and pre-equating. ; . 

* - * • V 

Comparing IRT-equating and pre-equating it can be. seen from Table 6 that for E7^ 

pre-equating is actually closer tq the criterion equatings but that the composition 

of the mean squared error is different. For IRT-equating the squared bias is the 

larger component whereas for pre-equating the variance' of the differences is actually 

the larger component. -For E8, however, the square bias, is the larger component for 

both IRT-equating and pre-equating.. 



SUMMARY AND CONCLUSIONS 



This investigation was concerned with how* well IRT equating and pre-equating could 
reproduce the conversion lin6 for two TSWE forms which had been previously equated* 
..by conventional observed score equating methods.' The approach was to determine firs^ 
how Well TSWE data fitted the three-pafdmeter logistic model and then to compare IRtf\ 
equating and pre-equating against three criterion equa'tings so that discrepancies in 
equating could be traced to more fundamental questions of fit. 

The various procedures for investigating fit suggested several violations of the 
assumptions of the model. At the item level some of the estimated item-on-ability 
regressions did not fit; t.he data as wel 4 l when the b parameter had been fixed to its^ 
estimated value barsed on*a pretest administration. This, is important in pre~e qua ting* 
Since presumably in practical application parameter estimates would be obtained from 
pretests. However, the fact that the problem was observed on just a few items suggest 
that the "problem may not be too serious. * * 

** • * * 

At the subscore level it was shown that two factors, corresponding to the two 

TSWE item types, were required to account for the internal structure of the TSWE, x 

thus suggesting a violation of the undimensionality assumption. This is to be # ex- 

pected, and pertfaps, so long as the nature of multidimensionality is constant; across* 

sample-form combinations, no great harm would occur. It so happened, however, tha't 

for fornuE8 the -two-factor model that fitted the other forms did not fit as well. 

Furthermore, thej equating parameters derived under conventional procedures are j^ery 

' 1 
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- V 

< • 

TABE.E 6. Weighted Mean Square Difference for Form E7, Using E3, E4, and E5 as the "Old" 
Form and Three Different Criteria, 



* • 




E3 


IRT- equating 
Old form 
EA 


E5 


E3 


Pre-V equating 
(y.d Form 
E4. 


E5 


Mean squared 


r* 1 

CI 


O CO 


■7 "I 

. iU 


4 

Q/. 


2.04 


.79 , 


. 70 


difference K 














.74 


criterion 




2.57 


' .73 ' • 


• .96 


2.11 


' .84 c 




C3 


3.39 . 


1.33 


1.A3 


2.77 


1.19 t 


• OO 


Variance of 


CI 


.53^ 


.20 


.89 


1,37 


* 

.74 


(it. 


difference 














.68 


criterion 


C2 


.53 


.20 


.92 


1.42 


".79 




C3 


.95 


.58 


1.19 


1.83 1.06 


.73 


Squared bias 


CI 


.2.00 


•51 


.05/ 


- ~V*67> 


* .05 


.06 


criterion 














.06 




C2 


2.04 


.53 


.OA 


,.69 


»05 




C3 


2.44* 


.74^ 


.24 


• .94 


.13 * 

4 


.15 



CI is. based on linear observed score equating using SAT-V and SAT-M as anchors;* 
C2 only uses SAT-V as an anchor; C3 is based. on equi-percentile equating. The * 
weighting function is the formula -score frequency distribution for the first^ 
national administration of E7 . . * * , 

.TABLtf 7. Weighted Mean Squared Dif ference for Form E8, Using E3, E4, and E5 as the "Old 11 
, Form and Three different Criteria 



> 


* 




E3 


# I$T-equating 
Old form 
E4 


E5 


E3 


Pre-equating 
Old form 
• E4 • 


.E5 


J* 


Mean squared 
difference - 
crtteridn 


; CI 
C2 1 
C3 


3.99 
3.84 
4.89. 


1.22 \ 

\ \\1 * 
1.7.0 


1.34 
1.29 , 
1.89* 


7.67 
7>53 
8.81 


• A. 00 
3.85 
4.8? . 


3.81 % > 
^.76 
# 4.^8, 




Variance of 

*dif ference 
• criterion 


CI 

C2 


1.75 

l.* 


.66 
.69 


.63 
.66 


'2.83 
' 2.90 


1.73 
1.73 


' 1^67 * ; 

1 


* 


* 


C3 


2.49 


, 1.07, 


1.09 


* 3.74 % 


2/48 


2\30 




Squared J>ias * 
criterion 


CI 

C2 

* * 


2.24 
2.10 


v55 
. .48' . 


.71 
• .£3 ; 


\ 4.84 
4.64 


2.27 
2. If 


2.23 
2.09 


* 




C3 


2.39 


;63~ 


, .80 


5.07 


2.42* 


2.38 



See footnote to Table 6. Thel weighting function is the formula score, frequency / 
distribution of the* first national administration of E8. 
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different for E8 compared to the other /forms, tfhis suggests that the internal 
structure of E8 was somewhat different 

Finally, at" the total score level, a slight bias was observed in the prediction ' 
of the mean and standard deviation of number right scores. The direction of the 
*bias,» however,,- depended on whether students with omitted responses were excluded- or 
included in the data. When students with omitted responses, were included in the data 
the mean and standard deviations of the expected distribution were slightly larger 
than the corresponding observed values. When students wifch omitted items * were ex- 
cluded the mean of the expected distribution was slightly smaller than the observed 
mean but the standard deviation was still larger than the observed value. 

The bias in the mean when students with omits are included appears to be partly 
*due to the fact that the omitted responses are not counted at all fn the observed dat 
whereas the calculation of the expected distribution assumes, all items ate answered*. 
Thus if students were instructed to respond to the omitted items their number-right 
score would increase. In fact, when students with omits are excluded^ the bias change 
direction, although it is of a very small magnitude, on the order of ".02 or. .03. 
There is no obvious explanation for this "net" bias in the mean or the bias in the 
standard deviation. Xt is not clear, for example,, whether the bias is due to the 
apparent multidimensionality of the 'data or an" inherent bias in tM«£0GIST proqedure. 
Clearly, further" research is needed. ® 

Departures from the models are to be expected with actual data. The important 
question, and the focus of this study, *is whether such departures seriously affect 
equating. To answer this question IRT equating and pre-equat.ing were done for TSWE 
forms E7 and E8, The results for E8 were disappointing in that large discrepancies 
were found between the operational conversion line^and the IRT-based equatings. How- 
ever, s^nce the two-factor model that fifete'd all forms did not fit E8, this appears 
to be the result of the aberrant internal structure of E8 rather than a failure of 
.IRT equating. In other words it id not clear that E8 is properly equated even by 
conventional methods, and, hence, ioif E8 „the converted scores may not be a good 
criterion. Therefore, it is wiser to formulate conclusions based on the results 
for E7 only* 

The equating results for E7 were much more^favorable, but some* consistent dis- 
crepancies were observed including an ovexestimation of the mean as well* as an un- 
derestimation of the standard deviation of the distribution of converted scores. 
The overestimation of* the convened 'score meanjj^ould seem to be consistent with the 
fact that .the mean number-right score is also overestimated when students with omits 
are included in the data. ' , 

-Ttye underestimation of the standard deviation of converted scores, however, is 
inconsistent with the earlier finding that the standard deviation of number-right 
scores is adtually overestimated by the IRT model. This "-suggests that something in 
the tramffipftaj^ion of formula scores to scaled scores is responsible for the mis- 
representation of the standard deviation. At least two limitations of the trans- 
formation procedure are obvious. One is the assumption of a linear relationship 
between formula scores and numbers-right scores,* which is false if students with 
o,mitted responses are included. A second limitation of the procedure lies in the 
fact, that to put the new form on scaie 4 it is necessary to apply A and B conversion 
parameters based on observed score equating to true formula scores.' Unfortunately, 
it is not obvious what alternative procedure could be devised to put on scale new 
formsso long as formula scoring was -In use. This suggests that 'IRT-equating and 
pre-equating would be more 'accurate if number-bright scoring was used rather than 

'formula scoringt Number-right scoring is no panacea, however, so long as#tests are 
Speeded^ (Ind they always wilJL be for at least some students under the Usual ad- 

. tflinistration procedures) . The problem is that under number-right scoring* it is 
advantageous to the, student to attempt an item even* if he or )ehe has to guess. § If 
they actually do so* they will "not be responding as a function of their ability 
and will thus create a violation of the model (Lord, 1980). 



" As .for pre-equating, based on E7, there is reason to be optimistic since the 
mean 4 squared differences were not consistently higher for pre-equating across ofd 
forras'Or criteria. However, the criteria for evaluating both IRT-equating and pre- 
equating are not defensible 6n other than practical grounds. \Thus, unless the com- 
parability of IRT and pre-equatirife were to change when evaluated against a more 
adequate criteria, we can 'reasonably expect that as better procedures for linking 
^parameters* are developed (see e.g., Petersen et al., 1981) pre-equating will prove 
to be a feasible operational procedure. 
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APPENDIX A: . Transformation of the b *s to Put the LOGIST Output from a New 
Administration on the Same Scale as the Output from an Old Administration 



Robust estimates of scale and location are used to determine the slope and inter 
N cept of the line relating the"Vs estimated from two samples of examinees. 

The estimate of scale foif each of the forms is a biweight estimate (see 
Mosteller and Tukey, 1977), The formulas are 

b = median of b's, 0 
b i " b 



i 9(MAD U >; 

D 



where MAD, ■ Median absolute deviation = median lb - b| , 
b / .0 1 i 1 



2" ~V 



2,4 



(b, - b)* (1 - uf) 



i i 



b "[ Z 1 (1 - uJ)Q - 5uJ)][ Z' (1 - u?)(l'- 5xxh - 1] 
i 1 i 1 1 



where E 1 indicates summation for u < 1 
i * 



Let be the b's on the iiew sample and b^ be the b's on the old sample. 




The slope of the line relating -the b's- is taken to be 

^ ' m - s B /s b . - 



Define an xy coordinate system "by 
x - (b + mB), 
y (-mb + B). 

Get a robust estimate of location separately f or 'x aiTd for y using the 
formulas. on page 205 of Mosteller and Tukey. Let y* = median of the,.y's f 



(1 - <■ 



c(MAD^)- 



) ) when (• 



y<- y* 2 



c(MAD ) 

y 



) < i 



otherwise 
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where MAD * Median \y ± - y*| » c « 6. Compute a new y* from the formula 
n 

z w i y i 

y* = ^ m (2) 

3 n 



E w 



(■ ■ 



Iterate through equations 1 and 2 until the change between two estimates of " y* is less 
•than .0001. Repeat the process for the x f s. 

m 

Transform these biweight estimates of locationin the xy coordinates back to the ( 
bB coordinates and require that the line with slopeNjn pass, through this point 

B* - (mx* + y*) / (m 2 + 1) , 
2 

b* a (x* - my*) / (m + 1) . 
The equation for the line that puts 'the old, parameters, on the new parameter scale is 

b T '« mb + B* - mb* * % ^ 

and m • 

. a T - (l/m)a. * . • - (4) 

The equation for the line that transforms the new parameters to the old parameter 
scale is # 

B T m B _ t B* - mb* ] ^ (5) 

mm • 1 

and' A T = mA . (6) 

where a and A are the discrimination estimates based on the old and new samples, 
respectively. 

' ^ 
To put the parameters of a new form onto the same scale as an old form, the new 
form and the old form must be administered to .random samples in a new administration. 
This- was done by spiraling. For the new administration we reestimate the parameters 
for the old form and estimate the parameters for the new form or random, samples of • 
' equal size. The new form is put onto the scale of- the old form in the new administra- 
tion by setting the means and standard deviations of the abilities equal. This is 
done in LOGIST by standardizing the abilities to a mean of 0 and a standard deviation 
of 1 for both forms'. Then the transformation that puts the old form', new sample, 
equations 5 and 6, onto scale is applied to the y parameters for the new form to put_ 
those parameters onto scale* 9 ' 



Reference 

Mosteller, F. , and J.W. Tukey,' Data Analyiis and Regression * Reading: Mass, 
-^Addison-Wesley, 1977. T \ ~ 
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APPENDIX B: Variance-Cbvariance, Correlation Matrices, an4 Mean Vectors for Five TSWE Forms 



'.Form 




'EUS 


OUS 


ESC 


OS.C 


• 

» Means 




EUS 


17,3214 


12.4191 


4.8800 


4.1147 


9.6094 


W506E3 


OUS 


.7535 


15.6848 


4.6327 


3.8515 


10.4343 


• 


' eSc 


.5842 

' J** 

.5556 


.5828 
.5466 


4.0280 
" ~~ ,5404 


1.9298 
3.1659 


4.3198 
4.4243 




• ' ^ EUS 


15.9720 


12.5510 


4.5698 


4:5351 % 


9.3356 " 


X104E4 


OUS ^ 


- .7488 


17.5905 


4.918> 


4.708,8 


9*1504 




ESC 


.5665 


.5810 


4.0743 


2.2352 


4.2981 


• 

t 


OSC 


.5639 


.5579 


.5503 * 


4.0494 


3.3263 , 




EUS 


16.6073 


11.5705 


4.2797 


4.7738 


9.0585 * 


\ 

X106E5 


OUS 


.7210 


15.5071 


4.2672 


4*. 5542 


9.5068 ' 




ESC 


.5218 


.5384 


4.0514 


2.1700 


375152 




OSC 


.5873 


.5798 


.5405 


3.978b 


4.4588 




EUS 


16.1853 


12.5826 


4.4694 


3.8127 


9.6603 


X406E7 


OUS 


.7574 


17.0511* 


4.8645 


4.1165 


11.0141 




ESC 


.5999 


.636* 


3.4295 


1.9011' 


* 4.7617 




OSC 


" .523$ 


.5507 


.5671 


3.2765 


4.8550 




EUS 


" 15.0591 


9.8126 


4.0468 


4.3847 


9.4296 


X506E8 


OUS 


.6888 


* ■ 13.4784 


3.7155 


3.6014 


10.3031 • 




ESC 


* .5678 


.5511 


3 .3727 


1.^83 


J '4.6323 




OSC 


.5682 


.4933 

» 


".5143 . 


3.9549^ 


J 4.0061 


Notation: EUS, eVen usage subscore; OUS, 
correction subscore; and OSC, odd sentence 
matrix appears below the diagonal. 


odd usage subscore; ESC, even sentence N 
correction subscore. The correlation 

) 

















APPENDIX C 



TABLE CI. Point-by-Toint Conversion Tables for ljorm E7 




IRT-equating Pre-equating 
Criterion Old Form « *01d Form 

Raw Score CI C2 C3 E3 E4 * E5 E3 E4 , E5 Frequency 



50-47 


60 


• 60 


60 


60 


. 60 


60 


59 


59 


~~6cr 


35 


46 


59 


59 


60, 


59 


59 . 


60 


58 


58 


59 


935 


45 


58 


58 


59 


'58 


58 


59 


57 , 


57 


58 


1091 


44 


57 


57 


57 


• 57 


57 


58 


56 


56 . 


57 • 


1215 


43 


56 


56 


56 


56 


56 


57 


55 


55 


55 


1286 • 


42 ' 


55 


55 


55 


55 


55 


56 


54. 


54 


54 * 


127 • 


41 


54 


54 


54 


55 


•54 


55' 


53 


53 


53 


1274 


40 


53 


53 


53 


54 


54 


54 


53 


52 


52 


1362 


39 


52 


52 


51 


53 


53 


53 


52 


51 


51 


1378 


38 


51 


51 


50 


52 


52 


52 


51 


51 


50 


1402 


37 


50 


50 


50 


51 


51 


51 


50 


50 


50 


224 


36 


49 


49 


49 


50 


50 


50 


49 


49 


49 4 


1336 


35 


48 


48 


48 


49 


49 


49 


48 


48 


48 < 


1390 


34 


- 47 


47 


46 


49 


48 


48 


48 


A7 


47 


1374 


y 31 


• 46 


46 


45 


48 


47 


47 




4* 


46 


1369 


32 


45 


45 


45 


47 


46 


46 


46 


45 


45 


327 


31 


44/ 




44 


46 


45 


45 


45 


44 


44 


1254 


30 


43 


43 


43 


45 


44 


44 


44 


* 44 


* 43 


1274 


29 


42 


42 


42 


44 


43* 


43 


43 


43 


'43 


1252 


28 


41 


41 


4T 


43 


42 


42 


43 


42 


; 42 


1209 


27 


40 


40 


40 


42 


41 


41 


42 


41 


41 - 


370 


26 


39 


* 39 


3? 


41 


40 


40/ 


41 


40 


40 


1098 


25 


38 


38 


.38 


40 


39 


. 3 9 


40 


39 


39 


1 AO 1 

lOol 


24 


37 


37 


37 


39 


^38 




39 , 


38 


38 


1045 


23 


36 


36 




38 


3f> 


. 37 


38 


37 


37 


967 


22 


35 


35 


35 


37 


3S\ 


36 r 


37- 


--36- 


36 ; 


352 


•4 

21* 


34 


34 


34 


36 


35 


* -35*^* 


36 


35 


35. 


845 


20 


33 


33 


33 


35 


34 


. 34 


35 


34 


34 


804 


19 


32 


32 . 


32 


34 


33 


33 


34 


33 


33 


750 


18 


31 


31 


31 


33 


31 


32 


33 


32 


32 


679 


17 


30 


30 


30 


32 


30 


31 


32 


31 


31 


" 276 


16 


2$ 


29 


29" 


• 31 


29 


30 


31 


30 


30, 


573 


15 


28 


28 


28 


30 


28 


28 


30 


29- 


29 , 


516 


14 


27 


27 


26 


28 


27 


27 


29 


28 


28 ' 


463 


13 


.26. 


26 


25 


27 


26 


26 


28 


, 27 


27 


397 


12 


25 


24 


24 


26 


25 


25 


27 


% 26 


26 


166 


11 


24 


23 


23 


25 


24 


24 


26 


25 


25 . 


300 


10 


22 


22 


22 


24 


23 


23 


2$ 


24 


24 > 


258 


9 


21 , 


21 


21 


23 


22 


22 


24 


23 


23 


228 


-12-8 


20 


20 ' 


20 


22 


. 21 


21 


23 


22 


22 


179 



Note: The frequencies shown are based o* the first national administration of E7 
(November 1975) except they have been divided by 10. 'The criterion equatlngs are based 
on observed score equating methodology as described in the text. CI is based on linear 
observed score equating using SAT-V and SAT-M as anchors; C2-dnly uses SAT-V as anchor; 
C3 is^tfased on equal percentile equating. 



TABLE C2. Point-by-Poirtt Conversion Table for Form E8 



IRT-equating Pre-equating 
Criterion Old Form Old Form 



Raw Score 


CI 

• 


C2 


C3 


C4 


*E3 


E4 


E5 


E3 


E4 


E5 


Frequency 


V 46-50 


60 


60 


60 


So 


,59 


60 


60 


59 


59 


60 


251 


45 


59 


59 


60 * 


59 


58 


59 


60 


58 




'59 


360 


'44 


^58— 


5^ 


58 


58 


57 


57 


58 


57 


V 


58 


• 463 


43 


57 


57 


57 


57 


> 56 


56 


57 


56 


K 


57 


537 


42 


56 


56 


56 


55 


55' 


55 


56 


55 


55^ 


56 


59 


41 


55 


55 


55 


*54 


54 


54 


54^ 


' 54 


54 


35 


604 


40 


53 
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Note: The frequencies shown are based on the firs*t national administration of E8 
(December ^75) except they have been divided by 10. The criterion equatings are based 
on observed score equating methodology as described in the. text. CI is based on linear, 
observed score equating using SAT-V andSAT-M as anchors; C2 only uses .SAT-V as anchor; - 
C3 is based on equal percentile equating.. C4 was derived by the same procedure used for 
CI but using a different "old" sample. 
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